Skip to content

Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal)#13745

Merged
dg845 merged 42 commits into
huggingface:mainfrom
Enderfga:add-anyflow-pipeline
May 22, 2026
Merged

Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal)#13745
dg845 merged 42 commits into
huggingface:mainfrom
Enderfga:add-anyflow-pipeline

Conversation

@Enderfga
Copy link
Copy Markdown
Contributor

@Enderfga Enderfga commented May 14, 2026

What does this PR do?

This PR adds pipelines for AnyFlow (paper, project page, official code, model weights), an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16, 32 NFE without retraining, and quality scales monotonically with steps — unlike consistency-based distillation, which often degrades as NFE grows.

Two new pipelines are added, both on top of a new FlowMapEulerDiscreteScheduler and reusing WanLoraLoaderMixin:

  • AnyFlowPipelineAnyFlowTransformer3DModel: bidirectional text-to-video built on the Wan2.1 backbone with an AnyFlowDualTimestepTextImageEmbedding conditioning on the source/target timestep pair (t, r).
  • AnyFlowFARPipelineAnyFlowFARTransformer3DModel: frame-level autoregressive variant (block-sparse causal flex_attention + KV cache + compressed-frame patch embedding) jointly handling T2V / I2V / V2V through one context_sequence argument.

Four checkpoints are released under the nvidia/anyflow collection (Wan2.1-T2V-{1.3B,14B} bidi + FAR-Wan2.1-{1.3B,14B} causal). All four have been validated bit-exact against the official NVlabs/AnyFlow reference on H200: forward L2 = 0.00e+00 for scheduler / transformer / bidi pipeline / FAR pipeline; backward grad delta is 4.88e-04, attributable to bf16 kernel non-determinism only (PR-vs-PR = PR-vs-reference, ratio 1.000); inference latency matches the reference at ±0.0% on both pipelines.

T2V inference example:

import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "anyflow_t2v.mp4", fps=16)

I2V inference example with the FAR pipeline (single conditioning frame → autoregressive rollout):

import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0
context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(2).to("cuda")

video = pipe(
    prompt="a cat walks across a sunlit lawn",
    context_sequence={"raw": context},
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "anyflow_i2v.mp4", fps=16)

Documentation: EN tutorial at docs/source/en/using-diffusers/anyflow.md, ZH tutorial at docs/source/zh/using-diffusers/anyflow.md, and three API pages (pipelines + two transformer model pages). Tests: 22 fast tests (transformer + scheduler, CPU) plus four pipeline test files, with slow integration tests gated on RUN_SLOW=1 @require_torch_accelerator for the released checkpoints.

anyflow-pr-presentation.mp4

Before submitting

Who can review?

@yiyixuxu @asomoza

Enderfga added 15 commits May 6, 2026 14:41
…vel imports

This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py,
pipeline_anyflow_causal.py, transformer_anyflow.py,
scheduling_flow_map_euler_discrete.py) come in subsequent commits.
The flow-map scheduler advances samples from timestep t to caller-provided
target r in a single Euler step, supporting any-step sampling on flow-map-
distilled checkpoints. It is a general-purpose scheduler — not specific to the
AnyFlow checkpoints.

Tests: 12 standalone tests covering instantiation, set_timesteps endpoints,
shift identity/monotonicity, step shape preservation, zero-interval identity,
one-shot sampling, train weight schemes, scale_noise endpoints.

Docs: api/schedulers/flow_map_euler_discrete.md
A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules:
* FAR causal blocks (init_far_model=True): block-sparse causal attention via
  flex_attention + compressed-frame patch embedding for frame-level
  autoregressive generation (Gu et al., 2025, arXiv:2503.19325).
* Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta
  timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary
  intervals (AnyFlow).

With both flags off, the model reduces to stock Wan2.1.

The class is intentionally self-contained rather than annotated with
'# Copied from diffusers.models.transformers.transformer_wan' because upstream
Wan has been refactored extensively since v0.35.1 (new WanAttention class,
different processor architecture).

Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and
determinism, return_dict variants, save/load round-trip with and without
init_far_model, gradient checkpointing toggle.

Docs: api/models/anyflow_transformer3d.md
* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using
  flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}.
* AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based
  causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints
  from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers.

Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel,
and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel
introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler.

Tests:
* tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests +
  slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers.
* tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant.

Reference slices for slow integration tests are deferred to Phase 7
(Final quality pass) where the user runs them on a real GPU.
Modeled on the Helios pipeline doc (PR huggingface#13208). Sections: paper link + abstract,
supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V
examples for both bidirectional and causal variants, autodoc trailers.
…ersion script

* Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING.
* AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because
  its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key.
* scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints
  (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all
  4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the
  upstream repo with argparse to match other diffusers conversion scripts.
* ruff format pass on all 5 source files (long lines + trailing comma fixes)
* check_dummies.py --fix_and_overwrite regenerated:
  - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler
  - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline

Local fast tests: 21/21 passed
  - 12 scheduler tests (FlowMapEulerDiscreteScheduler)
  - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load)

The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install
that matches the diffusers main branch's transformers >= compatibility floor.
The reference slices for slow integration tests (real GPU + 1.3B/14B
checkpoints) are intentionally left as TODO stubs to be captured by the user
on a real GPU machine before opening the PR.
…torials

Critical bug fixes (verified against precision-validation review):
* pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded
  transformer_dtype = torch.bfloat16 with self.transformer.dtype, so
  pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a
  dtype mismatch in the patch_embedding conv3d.
* transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in
  _build_causal_mask (was a copy-paste typo carried over from FAR-Dev).
* transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals
  and the `# noqa: F841` markers that were silencing the dead-store warning.
* transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the
  pipeline manages KV cache directly, the mixin's interface is unused.
* transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)`
  with try/except so the file imports cleanly on CPU CI / no-Triton machines.
* convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the
  stdlib logger (warning_once-style) and a module-level basicConfig.

Documentation accuracy:
* AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial:
  drop the fictitious `task_type` / `image` / `video` arguments and document
  the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`)
  to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes.
* Pipeline class docstrings + main doc: explicitly describe AnyFlow's
  two-stage LoRA distillation including DMD reverse-divergence supervision
  with Flow-Map backward simulation in stage 2 (was previously implicit).
* training_rollout: add detailed docstring explaining its role as the
  3-segment Flow-Map backward simulation entry point used during DMD training.
* Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and
  Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added
  and registered in both `_toctree.yml` files.

Tests:
* Skip `test_attention_slicing_forward_pass` in both pipeline test classes
  with a clear rationale (custom attention processor does not support slicing).
* All 21 standalone tests still pass (12 scheduler + 9 transformer).

Quality gates:
* `ruff check` clean across all AnyFlow files.
* `ruff format --check` reports 6 files already formatted.
* `python utils/check_copies.py` reports no diff.

Out of scope for this commit (deferred until reviewer feedback):
* Splitting AnyFlowTransformer3DModel into bidi + causal subclasses
* Unifying _forward_inference / _forward_cache return types
* Migrating model tests from plain unittest to BaseModelTesterConfig + mixins
* HF model card / config.json metadata updates on the nvidia/* repos
  (push to Hub manually before opening the PR)
… output

Round 2 of review feedback. Three groups of changes; transformer state-dict
keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact
validation remains valid.

A. Pipeline rename (mechanical, no behavior change):
   * Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers
     usually means an attention mask; AnyFlow's variant is FAR autoregressive,
     so the FAR name is more specific and matches the paper).
   * File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv).
   * Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv).
   * All references updated in src/, tests/, docs/, scripts/, plus stale
     anyflowcausalpipeline anchor links in tutorial markdown.

B. Pipeline test bug fixes (closes 19 fast-test failures reported by
   precision-validation reviewer):
   * pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets
     self._num_timesteps = num_inference_steps before the rollout, so the
     PipelineTesterMixin callback tests can read pipe.num_timesteps.
   * tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious
     task_type="t2v" kwarg that crashed every causal fast test (the FAR
     pipeline selects mode via context_sequence, not a task_type arg).

C. Transformer architecture cleanups (review-driven, no tensor changes):
   * Replace forward(*args, **kwargs) dispatcher with an explicit signature
     listing every supported kwarg (hidden_states, timestep, r_timestep,
     encoder_hidden_states, encoder_hidden_states_image, chunk_partition,
     clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal,
     attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile
     tracing.
   * Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput
     (BaseOutput dataclass with sample + kv_cache fields) for the two causal
     paths that need to also propagate kv_cache (_forward_inference and the
     newly return_dict-aware _forward_cache). _forward_train and
     _forward_bidirection now consistently return Transformer2DModelOutput.
     Pipeline call sites already use return_dict=False with positional
     unpacking, so the fix is transparent there.

Out of scope (deferred until canonical-org HF metadata sync):
   * Splitting AnyFlowTransformer3DModel into a bidi class plus an
     AnyFlowFARTransformer3DModel subclass — touches register_to_config keys
     and would require updating model_index.json on every released checkpoint.
   * Promoting chunk_partition from register_to_config to a forward-time
     argument (same reason).
   * Renaming training_rollout to _denoise — would break callers in the
     FAR-Dev on-policy trainer that produced the released checkpoints.

Local fast tests: 21/21 still pass (12 scheduler + 9 transformer).
ruff check, ruff format, and check_copies.py are all clean.
…nk_partition to FAR fast-test fixture

Two root causes for the 19 remaining PipelineTesterMixin failures, identified
by the H200 reviewer:

1. callback_on_step_end was accepted by __call__ but never invoked. Both
   pipelines pass it through to training_rollout (and FAR additionally through
   inference()), and inference_range now fires it after scheduler.step in
   the standard inference branch:

       if callback_on_step_end is not None:
           callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs}
           callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
           latents = callback_outputs.pop("latents", latents)
           prompt_embeds = ...
           negative_prompt_embeds = ...

   `nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite
   the closure-captured embeddings, matching upstream WanPipeline semantics.
   The 3-segment grad_timestep training rollout does not invoke the callback;
   it is intentionally training-only.

2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built
   the dummy transformer without a `chunk_partition`, leaving it None on the
   model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`.
   Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame
   each, matching the test's num_frames=9 -> 3 latent frames).

Local fast tests: 21/21 still pass.
ruff check, ruff format, and check_copies.py are all clean.
…ig + rename helpers

Major architectural refactor that aligns the integration with diffusers conventions
ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and
tensor flow are unchanged so the H200 bit-exact validation remains valid; only
the on-disk transformer/config.json fields move.

Changes:

1. **Sibling transformer classes** replace the flag-driven single class:
   * AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size /
     full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition
     kwargs (always-on for AnyFlow distilled checkpoints).
   * AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward
     paths (train / cache-prefill / autoregressive inference).
   * AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by
     the old setup_flowmap_model bootstrap) is removed; both classes now build
     AnyFlowDualTimestepTextImageEmbedding directly in __init__.
   * setup_flowmap_model / setup_far_model methods are removed; weight warm-start
     for far_patch_embedding (trilinear interpolation from patch_embedding) moves
     into AnyFlowFARTransformer3DModel.__init__.

2. **chunk_partition** is no longer a model config field. The FAR pipeline owns
   the schedule:
   * AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]
     matches the released 81-frame NVIDIA checkpoints.
   * AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition
     argument that overrides the default for non-default num_frames.

3. **training_rollout -> _denoise_rollout** rename across both pipelines and all
   English / Chinese docs that referenced it. Signals the method is internal to
   the pipeline driver, not a public training API.

4. **Conversion script + tests + docs + registries**:
   * scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right
     transformer class per variant; init_far_model / init_flowmap_model /
     chunk_partition kwargs are removed from the from_pretrained call.
   * Transformer test file split into AnyFlowTransformer3DModelTest and
     AnyFlowFARTransformer3DModelTest classes.
   * Pipeline test fixtures use the right class and pass chunk_partition via
     get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test).
   * New docs page docs/source/en/api/models/anyflow_far_transformer3d.md;
     anyflow_transformer3d.md rewritten for the bidi-only class.
   * AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py,
     src/diffusers/models/__init__.py, models/transformers/__init__.py and the
     dummy_pt_objects.py stubs.
   * docs/source/en/_toctree.yml: new entry for the FAR transformer page.

5. **Cleanups**:
   * Pipeline __call__ no longer passes is_causal=False to the bidi forward (the
     bidi class doesn't accept it).
   * Pipeline class docstrings drop stale references to init_*_model flags.

Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes).
ruff check / format / check_copies clean.

Hub artifacts (model_index.json, transformer/config.json, scheduler config) need
to be regenerated for the released checkpoints; the HF update guide will be
delivered separately.
…models.md

Hard violations (per official diffusers guidelines):

* drop einops dependency — replace 25+ rearrange() calls with native
  permute/reshape/unflatten in transformer + both pipelines
* device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now
  fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt
  per-device via _build_freqs (matches transformer_wan / transformer_flux
  pattern)
* migrate attention to dispatch_attention_fn — replace direct
  F.scaled_dot_product_attention calls with dispatch_attention_fn (works
  with sage / flash / native backends); introduce AnyFlowAttention(
  AttentionModuleMixin) with _default_processor_cls / _available_processors;
  rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and
  declare _attention_backend / _parallel_config class attrs
* drop dead config fields — qk_norm and added_kv_proj_dim are pruned from
  both transformer __init__ signatures and AnyFlowTransformerBlock;
  AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme
  the released checkpoints use) and has no add_k_proj path (T2V only)
* add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer
  classes for compile_repeated_blocks() support (matches Wan)
* annotate prepare_latents with `# Copied from diffusers.pipelines.wan.
  pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange
  to (B, T, C, H, W) layout is moved to the call site

State-dict keys are preserved (legacy Attention had identical to_q / to_k /
to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load
bit-exactly into the new AnyFlowAttention class.

The HF Hub config-update guide is updated correspondingly: transformer/
config.json now drops qk_norm and added_kv_proj_dim alongside the previous
init_far_model / init_flowmap_model / chunk_partition removals.

22 fast CPU tests still pass; ruff format / ruff check / check_copies all
clean.
…/head-dim fallbacks + KV-cache dtype + num_timesteps

Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR
causal path still calls flex_attention directly, which has hard requirements
(CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy
components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact
numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward
0.00e+00, backward kernel-nondet only, ratio 1.000).

Code fixes:

1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now
   short-circuit to an empty tensor when num_frames / height / width is 0.
   PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw
   spatial input becomes a 2x2 latent which then floors to 0 against
   compressed_patch_size=(1, 4, 4); the original
   `freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime.

2. flex_attention dispatch: split the module-load
   `torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager`
   (always available) plus `_flex_attention_compiled`, with a tiny wrapper
   that picks compiled for CUDA tensors and eager for CPU. Avoids
   torch._inductor C++ codegen failures that broke fast tests after
   `pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on
   bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd).

3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16
   (flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass
   `scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows
   contribute 0, so trimming the output back is mathematically equivalent.
   Released ckpts use head_dim=128 so the branch is never taken in production.

4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded
   `latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded
   bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and
   bias type (float) should be the same"); real bf16 ckpts are unaffected.

5. pipeline_anyflow_far._denoise_rollout sets
   `self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps`
   before the chunk loop, so PipelineTesterMixin.test_callback_cfg's
   `pipe.num_timesteps`-based assertion matches the actual number of callback
   fires (chunks * NFE) instead of the previous hardcoded num_inference_steps.

Tests:

* test_callback_inputs cannot pass without changing FAR's chunk-wise output
  semantics — it zeroes latents on the final step and asserts the *entire*
  output buffer is zero, but only the active chunk's slice is overwritten in
  a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale;
  callback functionality itself is still covered by test_callback_cfg.
* Full pytest run on tests/pipelines/anyflow/ +
  tests/models/transformers/test_models_transformer_anyflow.py +
  tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed,
  0 failed, 11 skipped.

Quality gates:

* `ruff check` and `ruff format --check` clean across all AnyFlow files.
* `python utils/check_copies.py` clean.
* `python utils/check_dummies.py` clean.
User-facing alignment with the official HF Hub model card and the day-of-announcement
materials at https://huggingface.co/collections/nvidia/anyflow.

* Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries).
* Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers
  copy uses the same Video-to-Video terminology as the official model card.
* Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow)
  HF collection link to the three tutorial intros.
* Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page
  / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live.
* Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project
  page) in place of the prior <github-org> / <project-page-url> placeholders.
* Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA
  affiliation in the main tutorial, API pipeline page, and both transformer
  model pages; BibTeX uses the standard `and others` to elide the full list
  until the next pass.

Working tree, CI gates, and tests after the change:

  ruff format --check                                  ✓
  ruff check                                           ✓
  python utils/check_copies.py                         ✓
  python utils/check_dummies.py                        ✓
  pytest tests/models + tests/schedulers (22 fast)     ✓

No production code logic changes — only docstring wording inside pipeline
files (TV2V → V2V).
Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and
Fang, Guian and others}, ...}`` block in both the English and Chinese
tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion,
...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors:
Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai,
Mike Zheng Shou.

Docs-only.
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines schedulers and removed size/L PR with diff > 200 LOC labels May 14, 2026
Enderfga and others added 2 commits May 14, 2026 20:57
Scheduler
- FlowMapEulerDiscreteScheduler.step now returns a
  FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False)
  and uses the conventional positional order (model_output, timestep, sample,
  r_timestep).
- Drop training-only helpers: adaptive_weighting, set_train_weight,
  get_train_weight, linear_timesteps_weights, and the weight_type config field.
- Add scale_model_input no-op for API parity; raise ValueError on missing
  r_timestep.

Transformer
- Remove gate_track debug write inside
  AnyFlowDualTimestepTextImageEmbedding.forward_timestep.
- Compile flex_attention lazily on first CUDA call instead of at import time.
- Replace assert with ValueError in build_block_mask.
- Resolve <arxiv-id> placeholders to 2605.13724.

Pipelines (AnyFlowPipeline + AnyFlowFARPipeline)
- Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__
  docstrings covering every argument.
- Move use_mean_velocity from __init__ to __call__ so save/load round-trips.
- Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout),
  the inner inference_range closure, and the redundant negative-prompt concat.
- Replace asserts with ValueError; wire show_progress to tqdm; rename inference
  -> _inference; remove dead current_timestep property.
- Update scheduler.step call sites to the new signature.
- Trim class docstrings to inference-only language.

Pipeline output
- Add Apache 2.0 license header; switch to relative import.

Auto pipeline / conversion script
- Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and
  AUTO_VIDEO2VIDEO_PIPELINES_MAPPING.
- Document the weights_only=False requirement in the conversion script.

Tests
- Scheduler tests use the new step signature and verify the Output dataclass
  contract.
- Drop the four obsolete training-weight tests; drop weight_type kwarg from
  pipeline test fixtures; remove internal milestone names from TODO comments.

Docs
- Resolve <arxiv-id> in the scheduler docs page.
- Trim DMD / on-policy distillation language in EN/ZH tutorials and the
  pipelines page; the paper abstract quote is preserved verbatim.
@dg845 dg845 requested review from dg845 and yiyixuxu May 16, 2026 00:16
@dg845
Copy link
Copy Markdown
Collaborator

dg845 commented May 21, 2026

@bot /style

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

Style bot fixed some files and pushed the changes.

Comment thread src/diffusers/models/transformers/transformer_anyflow.py Outdated
Comment thread src/diffusers/pipelines/anyflow/pipeline_anyflow.py Outdated
Comment thread src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py Outdated
Comment thread src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py Outdated
Comment thread src/diffusers/schedulers/scheduling_flow_map_euler_discrete.py
Comment thread src/diffusers/schedulers/scheduling_flow_map_euler_discrete.py
Comment thread tests/models/transformers/test_models_transformer_anyflow.py
Comment thread tests/models/transformers/test_models_transformer_anyflow.py
Comment on lines +145 to +149
# Torch-compile mixin intentionally skipped: FAR's `_build_causal_mask` uses
# `flex_attention.create_block_mask(_compile=False)`, which conflicts with the tracer
# assumptions made by the standard TorchCompileTesterMixin. The bidi transformer test file
# covers compile behavior; the FAR causal path is bit-exact-validated end-to-end on H200
# through the pipeline replay rather than per-module compile.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (non-blocking): my understanding is that the underlying cause of the incompatibility between AnyFlowFARTransformer3DModel and TorchCompileTesterMixin is that AnyFlowFARTransformer3DModel.forward calls torch.nn.attention.flex_attention.create_block_mask (via _build_causal_mask) internally. Since _build_causal_mask doesn't depend on the transformer internals, we could refactor this to be a standalone function and build the attention mask outside of the transformer forward method (e.g. in AnyFlowFARPipeline.__call__) and then pass it to forward via a attention_mask: BlockMask argument. This should allow pipe.transformer.compile() (and the compile tests) to work as expected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — this is a good direction. Since you marked it non-blocking and it reshapes the transformer's public attention_mask contract (the pipeline becomes responsible for building the BlockMask, which needs another bit-exact pass to validate), I'd like to defer it to a focused follow-up PR that pairs the _build_causal_mask extraction with re-enabling TorchCompileTesterMixin on the FAR transformer — that way the optimization and its dedicated test land in one go. Will track it as a TODO post-merge.

Copy link
Copy Markdown
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I think this PR is close to merge. Left some small comments and suggestions.

@Enderfga
Copy link
Copy Markdown
Contributor Author

Thanks for the careful third pass @dg845 — happy to hear we're close. Working through all 9 now; will reply per-thread as each lands. Should be done in ~1h.

…imesteps schedule

dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support)
matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask
refactor) is explicitly marked non-blocking and deferred to a follow-up that also
re-enables TorchCompileTesterMixin.

Comment cleanups:
- transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'.
- pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'.
- pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over
  chunks, inner over timesteps).'.
- pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`.
- scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep`
  error.

Custom schedule support:
- FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs
  mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged
  (linspace + shift); the validation + length-N → length-N+1 terminal-0 append are
  shared with the default path so on-schedule rollouts stay bit-exact.
- AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and
  `timesteps` kwargs, override num_inference_steps from their length, and forward
  to set_timesteps (matches LTX2Pipeline pattern).
- New scheduler tests: test_set_timesteps_custom_sigmas and
  test_set_timesteps_custom_timesteps cover both override paths.

Dtype skip on save/load:
- TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip
  test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring
  WanTransformer3DModel's skip — the test's tolerance requirements are too high for
  meaningful signal under AnyFlow's flow-map mixed-precision sampling.
@Enderfga
Copy link
Copy Markdown
Contributor Author

Third pass addressed in 04c8621, per-thread replies inline. Recap:

# Item Outcome
1–5, 7, 8 Comment cleanups + test_from_save_pretrained_dtype_inference skip + import pytest Applied verbatim
6 Custom sigmas / timesteps schedule set_timesteps now accepts both (matches FlowMatchEulerDiscreteScheduler); both pipelines expose sigmas and timesteps kwargs in __call__ (matches LTX2Pipeline). Default path is bit-identical (max abs diff = 0.0) so existing rollouts are unchanged.
9 _build_causal_mask standalone refactor Marked non-blocking — deferring to a follow-up PR that also re-enables TorchCompileTesterMixin on the FAR transformer, so the optimization and its test land together.

Test sweep: 88 passed / 83 skipped / 0 failed on CPU. make fix-copies, ruff format / check, check_forward_call_docstrings all clean.

Ready for another look.

Enderfga and others added 2 commits May 22, 2026 15:25
CI doc-builder style check flagged 3 files with docstring lines >119 chars.
Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat;
content unchanged, line wrapping only.
Comment thread src/diffusers/schedulers/scheduling_flow_map_euler_discrete.py Outdated
@dg845
Copy link
Copy Markdown
Collaborator

dg845 commented May 22, 2026

Hi @claude, can you do a final review of this PR?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

Claude finished @dg845's task in 2m 19s —— View job


Final Review of AnyFlow PR

  • Read review guidelines (AGENTS.md, models.md, pipelines.md, modular.md)
  • Review transformer_anyflow.py (bidirectional transformer)
  • Review transformer_anyflow_far.py (FAR causal transformer)
  • Review pipeline_anyflow.py (bidirectional pipeline)
  • Review pipeline_anyflow_far.py (FAR causal pipeline)
  • Review scheduling_flow_map_euler_discrete.py (scheduler)
  • Review pipeline output, __init__ files, and auto_pipeline registration
  • Review tests
  • Trace call paths for dead code analysis
  • Post final review

Hi @dg845 — here's my final review. The PR is in good shape after three review rounds. Most of the structural and API items from your earlier reviews have been cleanly addressed. I found a few remaining items — one is a pending suggestion from you that hasn't been applied yet, and the rest are minor.

Blocking

1. Unaddressed review suggestion: torch.zeros(1)working_sigmas.new_zeros(1) in scheduler

Your most recent suggestion on scheduling_flow_map_euler_discrete.py:185 hasn't been applied yet:

# Current:
full_sigmas = torch.cat([working_sigmas, torch.zeros(1, dtype=working_sigmas.dtype)])

# Should be:
full_sigmas = torch.cat([working_sigmas, working_sigmas.new_zeros(1)])

You noted this prevents test failures on CUDA in tests/pipelines/anyflow/test_anyflow_far.py because new_zeros() preserves both device and dtype of working_sigmas. Since working_sigmas is always on CPU at this point (built via torch.tensor(..., dtype=torch.float64)), the device mismatch is currently latent — but new_zeros is the correct defensive pattern and matches how other schedulers handle this.

Non-blocking

2. _forward_far_patchify and _forward_far_patchify_inference are still public

In transformer_anyflow_far.py:1004 and transformer_anyflow_far.py:1028, these methods are only called internally by _forward_train/_forward_cache and _forward_inference respectively. The previous review flagged this as needing to be private (_forward_far_patchify / _forward_far_patchify_inference). The method names haven't been updated. (Low priority — not user-facing API.)

3. encode_video method is duplicated across both pipelines without # Copied from

Both pipeline_anyflow.py:359 and pipeline_anyflow_far.py:384 have identical encode_video implementations. Per pipelines.md gotcha #5, when you reuse a method from another pipeline, add a # Copied from annotation so make fix-copies keeps them in sync. Consider annotating the FAR pipeline's encode_video with:

# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video

4. check_inputs is duplicated across both pipelines without # Copied from

Similarly, check_inputs in both pipelines is nearly identical (the FAR version adds the (num_frames - 1) % 4 check). The shared base could be linked via # Copied from with the FAR-specific addition. This is a minor consistency nit.

5. FAR pipeline's encode_kv_cache — consider _encode_kv_cache

Per pipelines.md gotcha #6: "If a method is only used by another method, make it private". encode_kv_cache at pipeline_anyflow_far.py:401 is only called from within __call__ and isn't part of the pipeline lifecycle that users would call independently. Making it _encode_kv_cache would signal this more clearly. (Though the argument for keeping it public is that users doing advanced inference might want to call it — so this is judgment-call territory.)

6. Minor: # step 3: comments left in FAR transformer internal methods

In transformer_anyflow_far.py:1221, transformer_anyflow_far.py:1341, and transformer_anyflow_far.py:1441, there are leftover # step 3: generate attention mask comments. These are remnants from the original numbered step structure and don't match any current numbering scheme. They're harmless but slightly confusing.

Suggestions / additional info: dead code trace

I traced the call paths from both pipelines through the transformers to identify likely dead code.

Bidirectional pipeline → AnyFlowTransformer3DModel:

  • The pipeline never passes encoder_hidden_states_image (always None). The image embedding path at line 690-691 (if encoder_hidden_states_image is not None) is therefore unreachable via the current pipeline. This is expected — I2V conditioning may be added later or users may call the model directly.
  • attention_kwargs is consumed by @apply_lora_scale but not forwarded further into the attention processors. This is consistent with how WanTransformer3DModel handles it — fine as-is.

FAR pipeline → AnyFlowFARTransformer3DModel:

  • The pipeline only uses _forward_cache (via encode_kv_cache) and _forward_inference (via the denoising loop). _forward_train is never called by the pipeline — expected, it's for training.
  • clean_hidden_states and clean_timestep are only used in _forward_train and _forward_cache. The pipeline passes them as None from both call sites (denoising loop and encode_kv_cache), so the clean_hidden_states is not None branches in _forward_cache and _build_causal_mask are unreachable during pipeline inference. Expected — training code path.
  • encoder_hidden_states_image is similarly never passed by the pipeline — the I2V image embedding concatenation path at _forward_inference:1237 and _forward_cache:1361 is unreachable via the current pipeline.

All of these are fine as advisory — they're either training paths or extension points for future I2V support. No action needed.

Summary

The PR is well-structured and the code quality is high. The transformer split is clean, the # Copied from annotations are correctly used, the scheduler API is well-designed with proper _step_index/_begin_index state machine support, and the test coverage is solid. The main remaining item is the scheduler new_zeros fix from your latest review, plus the minor items above. This looks close to merge.

…anup

dg845 blocking suggestion (r3287274209):
- scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)`
  instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits
  both device and dtype from working_sigmas. The current working_sigmas always
  starts on CPU so the device mismatch is latent, but new_zeros is the correct
  defensive pattern and matches how the published FAR test fixtures run on CUDA.

Claude bot final-review follow-ups:
- transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask`
  comments left over from the original numbered-step structure (bot huggingface#6).
- pipeline_anyflow_far.py: annotate `encode_video` with
  `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video`
  and align docstring + inline comment so `make fix-copies` keeps them in sync (bot huggingface#3).

Skipped (not real / judgment-call):
- bot huggingface#2 (private rename of `_forward_far_patchify*`) — already done in 84605d5;
  bot was looking at a stale snapshot.
- bot huggingface#4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra
  `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version,
  so a clean `# Copied from` link would require restructuring. Bot called it a
  consistency nit; leaving as-is.
- bot huggingface#5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as
  judgment-call territory; the helper is a coherent operation that advanced
  inference callers may want to invoke directly.
Copy link
Copy Markdown
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work on this PR!

@Enderfga
Copy link
Copy Markdown
Contributor Author

Thanks so much for the careful, patient guidance across all four review rounds, @dg845 — really appreciate the time you put in. Excited to see AnyFlow land! 🚀

@dg845
Copy link
Copy Markdown
Collaborator

dg845 commented May 22, 2026

Merging as the CI failures are unrelated.

@dg845 dg845 merged commit e39aecf into huggingface:main May 22, 2026
13 of 15 checks passed
Enderfga added a commit to Enderfga/AnyFlow that referenced this pull request May 22, 2026
VideoProcessor.preprocess_video's 5D contract is (B, T, C, H, W) — the
diffusers AnyFlow PR aligned its docstring + EXAMPLE_DOC_STRING with this
in the third review pass (huggingface/diffusers#13745, commits ffdc969
and downstream). This README's I2V example still showed (B, C, T, H, W)
and the matching unsqueeze(2); update both so users following the README
verbatim get a tensor the diffusers pipeline accepts.
Enderfga added a commit to Enderfga/AnyFlow that referenced this pull request May 22, 2026
…Flow classes

The training pipeline (far/main.py:save_checkpoint) emits .pt files keyed by
'ema' / 'model_state_dict_g'; the diffusers pipelines load from a structured
directory written by pipeline.save_pretrained(). Until now this conversion
script wrapped the .pt into a pipeline built from this repository's
WanAnyFlowPipeline / FARWanAnyFlowPipeline / FAR_Wan_Transformer3DModel /
FlowMapDiscreteScheduler — so the resulting model_index.json referenced
far.* paths that diffusers.from_pretrained couldn't resolve.

Switch the conversion to the diffusers AnyFlow classes (introduced in
huggingface/diffusers#13745):

  - AnyFlowTransformer3DModel        (bidirectional T2V variants)
  - AnyFlowFARTransformer3DModel     (FAR causal variants)
  - AnyFlowPipeline / AnyFlowFARPipeline
  - FlowMapEulerDiscreteScheduler

Output directories now load via AnyFlowPipeline.from_pretrained(...) with
no compat shim. The CLI surface (OmegaConf model_type / model_path /
model_save_dir keys + auto-append of model_type to the save dir) is
preserved.

Tensor keys are unchanged across FAR_Wan_Transformer3DModel and the
diffusers AnyFlow classes (bit-exact L2=0 against the released NVlabs
checkpoints), so load_state_dict(strict=False) handles the EMA bookkeeping
fields without dropping any real weights.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation models pipelines schedulers size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants